# Global imports and settings
%matplotlib inline
from preamble import *
from IPython.display import display, Image
plt.rcParams['savefig.dpi'] = 100 # Use 300 for PDF, 100 for slides
HTML('''<style>html, body{overflow-y: visible !important} .CodeMirror{min-width:105% !important;} .rise-enabled .CodeMirror, .rise-enabled .output_subarea{font-size:140%; line-height:1.2; overflow: visible;} .output_subarea pre{width:110%}</style>''') # For slides
In this notebook, we will:
We often distinguish 3 types of machine learning:
Note:
2 subtypes:
Most supervised algorithms that we will see can do both.
display(Image('./images/01_supervised.png', width=800))
display(Image('./images/01_classification.png', width=400))
display(Image('./images/01_regression2.png', width=400))
display(Image('./images/01_rl2.png', width=600))
display(Image('./images/01_cluster2.png', width=600))
display(Image('./images/01_dimred.png', width=800))
display(Image('./images/01_terminology.png', width=800))
A typical machine learning system has multiple components:
display(Image('./images/01_ml_systems.png', width=800))
scikit-learn is the most prominent Python library for machine learning:
Supervised learning:
Unsupervised learning:
Model selection and evaluation:
Multiple options:
sklearn.datasetspandas or numpyClassify types of Iris flowers (setosa, versicolor, or virginica) based on the flower sepals and petal leave sizes.
display(Image('https://www.math.umd.edu/~petersd/666/html/iris_with_labels.jpg', width=500))
Iris is included in scikitlearn, we can just load it.
This will return a Bunch object (similar to a dict)
from sklearn.datasets import load_iris
iris_dataset = load_iris()
print("Keys of iris_dataset: {}".format(iris_dataset.keys()))
print(iris_dataset['DESCR'][:193] + "\n...")
The targets (classes) and features are stored as lists, the data as an ndarray
print("Targets: {}".format(iris_dataset['target_names']))
print("Features: {}".format(iris_dataset['feature_names']))
print("Shape of data: {}".format(iris_dataset['data'].shape))
print("First 5 rows:\n{}".format(iris_dataset['data'][:5]))
The targets are stored separately as an ndarray, with indices pointing to the features
print("Target names: {}".format(iris_dataset['target_names']))
print("Targets:\n{}".format(iris_dataset['target']))
All scikitlearn classifiers follow the same interface
To evaluate our classifier, we need to test it on unseen data.
train_test_split: splits data randomly in 75% training and 25% test data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'],
random_state=0)
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))
Note: there are several problems with this approach that we will discuss later:
We can use a library called pandas to easily visualize our data. Note how several features allow to cleanly split the classes.
# Build a DataFrame with training examples and feature names
iris_df = pd.DataFrame(X_train,
columns=iris_dataset.feature_names)
# scatter matrix from the dataframe, color by class
sm = pd.scatter_matrix(iris_df, c=y_train, figsize=(10, 10),
marker='o', hist_kwds={'bins': 20}, s=60,
alpha=.8, cmap=mglearn.cm3)
The first model we'll build is called k-Nearest Neighbor, or kNN. More about that soon.
kNN is included in sklearn.neighbors, so let's build our first model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
Let's create a new example and ask the kNN model to classify it
X_new = np.array([[5, 2.9, 1, 0.2]])
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
iris_dataset['target_names'][prediction]))
Feeding all test examples to the model yields all predictions
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))
We can now just count what percentage was correct
print("Score: {:.2f}".format(np.mean(y_pred == y_test)))
The score function does the same thing (by default)
print("Score: {:.2f}".format(knn.score(X_test, y_test) ))
display(Image('http://scikit-learn.org/stable/_images/sphx_glr_plot_underfitting_overfitting_001.png', width=700))
In all supervised algorithms that we will discuss, we'll cover:
for k=1: return the class of the nearest neighbor
mglearn.plots.plot_knn_classification(n_neighbors=26)
for k>1: do a vote and return the majority (or a confidence value for each class)
mglearn.plots.plot_knn_classification(n_neighbors=3)
Let's build a kNN model for this dataset (called 'Forge')
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
X, y = mglearn.datasets.make_forge()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
print("Test set accuracy: %.2f" % clf.score(X_test, y_test))
We can plot the prediction for each possible input to see the decision boundary
fig, axes = plt.subplots(1, 3, figsize=(10, 3))
for n_neighbors, ax in zip([1, 3, 9], axes):
clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X, y)
mglearn.plots.plot_2d_separator(clf, X, fill=True, eps=0.5, ax=ax, alpha=.4)
mglearn.discrete_scatter(X[:, 0], X[:, 1], y, ax=ax)
ax.set_title("{} neighbor(s)".format(n_neighbors))
ax.set_xlabel("feature 0")
ax.set_ylabel("feature 1")
_ = axes[0].legend(loc=3)
Using few neighbors corresponds to high model complexity (left), and using many neighbors corresponds to low model complexity and smoother decision boundary (right).
We can more directly measure the effect on the training and test error on a larger dataset (breast_cancer)
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
cancer.data, cancer.target, stratify=cancer.target, random_state=66)
# Build a list of the training and test scores for increasing k
training_accuracy = []
test_accuracy = []
k = range(1, 11)
for n_neighbors in k:
# build the model
clf = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_train, y_train)
# record training and test set accuracy
training_accuracy.append(clf.score(X_train, y_train))
test_accuracy.append(clf.score(X_test, y_test))
plt.plot(k, training_accuracy, label="training accuracy")
plt.plot(k, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
_ = plt.legend()
For small numbers of neighbors, the model is too complex, and overfits the training data. As more neighbors are considered, the model becomes simpler and the training accuracy drops, yet the test accuracy increases, up to a point. After about 8 neighbors, the model starts becoming too simple (underfits) and the test accuracy drops again.
for k=1: return the target value of the nearest neighbor
mglearn.plots.plot_knn_regression(n_neighbors=4)
for k>1: return the mean of the target values of the k nearest neighbors
mglearn.plots.plot_knn_regression(n_neighbors=9)
To do regression, simply use KNeighborsRegressor instead
from sklearn.neighbors import KNeighborsRegressor
X, y = mglearn.datasets.make_wave(n_samples=40)
# split the wave dataset into a training and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
# Instantiate the model, set the number of neighbors to consider to 3:
reg = KNeighborsRegressor(n_neighbors=3)
# Fit the model using the training data and training targets:
reg.fit(X_train, y_train)
The default scoring function for regression models is $R^{2}$. It measures how much of the data variability is explained by the model. Between 0 and 1.
print("Test set predictions:\n{}".format(reg.predict(X_test)))
print("Test set R^2: {:.2f}".format(reg.score(X_test, y_test)))
We can again output the predictions for each possible input, for different values of k.
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# create 1000 data points, evenly spaced between -3 and 3
line = np.linspace(-3, 3, 1000).reshape(-1, 1)
for n_neighbors, ax in zip([1, 3, 9], axes):
# make predictions using 1, 3 or 9 neighbors
reg = KNeighborsRegressor(n_neighbors=n_neighbors)
reg.fit(X_train, y_train)
ax.plot(line, reg.predict(line))
ax.plot(X_train, y_train, '^', c=mglearn.cm2(0), markersize=8)
ax.plot(X_test, y_test, 'v', c=mglearn.cm2(1), markersize=8)
ax.set_title(
"{} neighbor(s)\n train score: {:.2f} test score: {:.2f}".format(
n_neighbors, reg.score(X_train, y_train),
reg.score(X_test, y_test)))
ax.set_xlabel("Feature")
ax.set_ylabel("Target")
_ = axes[0].legend(["Model predictions", "Training data/target",
"Test data/target"], loc="best")
We see that again, a small k leads to an overly complex (overfitting) model, while a larger k yields a smoother fit.
Conclusions:
We met our first algorithm (kNN)
Next lectures: